Show the code
import pandas as pd
import numpy as np
import json
from lets_plot import *
LetsPlot.setup_html(isolated_frame=True)Course DS 250
[MIGUEL SMITH]
A SHORT (2-3 SENTENCES) PARAGRAPH THAT DESCRIBES KEY INSIGHTS TAKEN FROM METRICS IN THE PROJECT RESULTS THINK TOP OR MOST IMPORTANT RESULTS. (Note: this is not a summary of the project, but a summary of the results.)
Using the most recent 2015 data, the data suggests that San Francisco airport has the worst delays due to weather and potential other factors generalized by a delay ratio based on delayed flights vs total flights. Additionally the worst month to travel by plane may be June.
Fix all of the varied missing data types in the data to be consistent (all missing values should be displayed as “NaN”). In your report include one record example (one row) from your new data, in the raw JSON format. Your example should display the “NaN” for at least one missing value.__
Upon looking at the data I found a few elements that did not help the data set. I replaced ‘n/a’,‘unknown’,’‘,-999 with NaN int the data set. There was also a number called 1500+ which cannot be used numerically so I changed it to a simple 1500. I also changed the ’num_of_delays_carrier’ column to a integer column.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 924 entries, 0 to 923
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 airport_code 924 non-null object
1 airport_name 868 non-null object
2 month 897 non-null object
3 year 901 non-null float64
4 num_of_flights_total 924 non-null int64
5 num_of_delays_carrier 924 non-null int64
6 num_of_delays_late_aircraft 884 non-null float64
7 num_of_delays_nas 924 non-null int64
8 num_of_delays_security 924 non-null int64
9 num_of_delays_weather 924 non-null int64
10 num_of_delays_total 924 non-null int64
11 minutes_delayed_carrier 872 non-null float64
12 minutes_delayed_late_aircraft 924 non-null int64
13 minutes_delayed_nas 876 non-null float64
14 minutes_delayed_security 924 non-null int64
15 minutes_delayed_weather 924 non-null int64
16 minutes_delayed_total 924 non-null int64
dtypes: float64(4), int64(10), object(3)
memory usage: 122.8+ KB
<bound method NDFrame.to_json of airport_code airport_name month \
0 ATL Atlanta, GA: Hartsfield-Jackson Atlanta Intern... January
year num_of_flights_total num_of_delays_carrier \
0 2005.0 35048 1500
num_of_delays_late_aircraft num_of_delays_nas num_of_delays_security \
0 NaN 4598 10
num_of_delays_weather num_of_delays_total minutes_delayed_carrier \
0 448 8355 116423.0
minutes_delayed_late_aircraft minutes_delayed_nas \
0 104415 207467.0
minutes_delayed_security minutes_delayed_weather minutes_delayed_total
0 297 36931 465533 >
Which airport has the worst delays? Describe the metric you chose, and why you chose it to determine the “worst” airport. Your answer should include a summary table that lists (for each airport) the total number of flights, total number of delayed flights, proportion of delayed flights, and average delay time in hours.
I decided to make a ratio of delays to total flights written in the ‘delay_ratio’ column. I did it first for all years but decided to instead use data from 2015 to understand the most recent delays. Included is also the average time delayed in hours. Even though San Francisco doesn’t have the highest average of time delayed I argue that it is stil te worse airport because of the delay ratio.
# Include and execute your code here
df = df.replace(['n/a','unknown','',-999],np.nan)
df = df.replace(['1500+'],1500)
df['num_of_delays_carrier']=df['num_of_delays_carrier'].astype(int)
df_2015 = df[df['year'] == 2015]
monthly_summary = df_2015.groupby('airport_code')[['num_of_flights_total',
'num_of_delays_total',
'minutes_delayed_total']].sum()
monthly_summary['delay_ratio'] = monthly_summary['num_of_delays_total'] / monthly_summary['num_of_flights_total'] * 100
# minutes to hours
monthly_summary['hours_delayed_total'] = monthly_summary['minutes_delayed_total'] / 60
# delay in hours, did I do this right?
monthly_summary['avg_delay_hours'] = monthly_summary['hours_delayed_total'] / monthly_summary['num_of_delays_total']
monthly_summary = monthly_summary.sort_values('delay_ratio', ascending=False)
monthly_summary = monthly_summary.round(2)
monthly_summary.iloc[0:1, ].to_json<bound method NDFrame.to_json of num_of_flights_total num_of_delays_total \
airport_code
SFO 162136 35080
minutes_delayed_total delay_ratio hours_delayed_total \
airport_code
SFO 2213414 21.64 36890.23
avg_delay_hours
airport_code
SFO 1.05 >
What is the best month to fly if you want to avoid delays of any length? Describe the metric you chose and why you chose it to calculate your answer. Include one chart to help support your answer, with the x-axis ordered by month. (To answer this question, you will need to remove any rows that are missing the Month variable.)
As we can see in the chart provided, June could be the worst month to fly in. I decided to quantify this guess by flight ratio. I made sure to remove any rows wihtout a month value. To clarify, the choice to use 2015 data has the hope to display the most recent and therefore more likely circumstances for flights.
# Include and execute your code here, lambda stuff
df = df.replace(['n/a','unknown','',-999],np.nan)
df = df.replace(['1500+'],1500)
df['num_of_delays_carrier']=df['num_of_delays_carrier'].astype(int)
df_2015 = df[df['year'] == 2015]
df_2015withmonth = df_2015.dropna(subset=['month'])
monthly_summary = df_2015withmonth.groupby('month').agg({
'num_of_flights_total': 'sum',
'num_of_delays_total': 'sum',
'num_of_delays_total': lambda x: (x.sum() / df_2015['num_of_flights_total'].sum()) * 100
})
monthly_summary['delay_ratio'] = monthly_summary['num_of_delays_total'] / monthly_summary['num_of_flights_total'] * 100
monthly_summary = monthly_summary.sort_values('delay_ratio', ascending=False)
monthly_summary
chart=(ggplot(monthly_summary.reset_index(), aes(x='month', y='delay_ratio'))
+ geom_bar(stat='identity', fill='blue')
+ ggtitle('2015 Monthly Flight Delay Ratios')
+ xlab('Month')
+ ylab('Delay Ratio (%)')
)
chartAccording to the BTS website, the “Weather” category only accounts for severe weather delays. Mild weather delays are not counted in the “Weather” category, but are actually included in both the “NAS” and “Late-Arriving Aircraft” categories. Your job is to create a new column that calculates the total number of flights delayed by weather (both severe and mild). You will need to replace all the missing values in the Late Aircraft variable with the mean. Show your work by printing the first 5 rows of data in a table. Use these three rules for your calculations:
a. 100% of delayed flights in the Weather category are due to weather
b. 30% of all delayed flights in the Late-Arriving category are due to weather
c. From April to August, 40% of delayed flights in the NAS category are due to weather. The rest of the months, the proportion rises to 65%
I split the code up into a few parts and I’m hoping the database rewrites are easy to follow. This analysis shows that ‘less severe’ weather delays have occurance than ‘severe’ weather.
# Include and execute your code here
df = df.replace(['n/a','unknown','',-999],np.nan)
df = df.replace(['1500+'],1500)
df['num_of_delays_carrier'] = df['num_of_delays_carrier'].astype(int)
# Create dfweather and clean it
dfweather = df.copy()
dfweather = dfweather.dropna(subset=['month','num_of_delays_late_aircraft','minutes_delayed_nas','year','minutes_delayed_carrier','minutes_delayed_nas'])# Include and execute your code here
# Weather calculations
dfweather['weather_severe'] = dfweather['num_of_delays_weather']
late_aircraft_mean = dfweather['num_of_delays_late_aircraft'].mean()
dfweather['num_of_delays_late_aircraft'] = dfweather['num_of_delays_late_aircraft'].fillna(late_aircraft_mean)
dfweather['weather_late_aircraft'] = dfweather['num_of_delays_late_aircraft'] * 0.30# Include and execute your code here
# 3. NAS weather delays (40% Apr-Aug, 65% other months)
# NAS weather calculation
dfweather['weather_nas'] = dfweather.apply(
lambda x: x['num_of_delays_nas'] * 0.40 if x['month'] in [4, 5, 6, 7, 8]
else x['num_of_delays_nas'] * 0.65,
axis=1
)
# Total weather delays
dfweather['total_weather_delays'] = dfweather['weather_severe'] + dfweather['weather_late_aircraft'] + dfweather['weather_nas']
print(dfweather[['month', 'weather_severe', 'weather_late_aircraft', 'weather_nas', 'total_weather_delays']].head())
df.iloc[0:1, ].to_json month weather_severe weather_late_aircraft weather_nas \
1 January 233 278.4 607.75
3 January 306 676.5 3519.75
4 January 56 204.0 414.70
5 January 114 219.9 757.90
6 January 270 169.2 561.60
total_weather_delays
1 1119.15
3 4502.25
4 674.70
5 1091.80
6 1000.80
<bound method NDFrame.to_json of airport_code airport_name month \
0 ATL Atlanta, GA: Hartsfield-Jackson Atlanta Intern... January
year num_of_flights_total num_of_delays_carrier \
0 2005.0 35048 1500
num_of_delays_late_aircraft num_of_delays_nas num_of_delays_security \
0 NaN 4598 10
num_of_delays_weather num_of_delays_total minutes_delayed_carrier \
0 448 8355 116423.0
minutes_delayed_late_aircraft minutes_delayed_nas \
0 104415 207467.0
minutes_delayed_security minutes_delayed_weather minutes_delayed_total
0 297 36931 465533 >
Using the new weather variable calculated above, create a barplot showing the proportion of all flights that are delayed by weather at each airport. Describe what you learn from this graph.
We can see that SFO once again has pretty bad delays, now we can see how those delays are played by weather.
# Include and execute your code here
# Calculate the proportion of weather delays per airport
from lets_plot import *
LetsPlot.setup_html()
airport_sum = dfweather.groupby('airport_code').agg({
'total_weather_delays': 'sum',
'num_of_flights_total': 'sum'
}).reset_index()
airport_sum['weather_delay_ratio'] = (airport_sum['total_weather_delays'] / airport_sum['num_of_flights_total']) * 100
airport_sum = airport_sum.sort_values('weather_delay_ratio', ascending=True)
chart = (ggplot(data=airport_sum)
+ aes(x='weather_delay_ratio', y='airport_code')
+ geom_bar(stat='identity', fill='#87CEEB', width=0.7) # Use hex color code
+ ggtitle('Weather Delay Proportion by Airport')
+ xlab('Percentage of Flights Delayed by Weather')
+ ylab('Airport Code')
)
chart.show()Which delay is the worst delay? Create a similar analysis as above for Weahter Delay with: Carrier Delay and Security Delay. Compare the proportion of delay for each of the three categories in a Chart and a Table. Describe your results.
type your results and analysis here
#
## Read in libraries
#import pandas as pd
#import numpy as np
#from lets_plot import *
#LetsPlot.setup_html(isolated_frame=True)
#df = pd.read_json("https://github.com/byuidatascience/data4missing/raw/master/data-raw/flights_missing/flights_missing.json")
#df['month'].value_counts()['n/a']
#df.isna().sum()
#pd.crosstab()# Read in libraries
import pandas as pd
import numpy as np
from lets_plot import *
LetsPlot.setup_html(isolated_frame=True)
from sklearn.model_selection import train_test_split
from sklearn import metrics
# Read the data
df = pd.read_csv("https://github.com/byuidatascience/data4dwellings/raw/master/data-raw/dwellings_ml/dwellings_ml.csv")
# For this dataset, we'll use 'before1980' as the target column
X = df.drop(columns=['before1980'])
y = df['before1980']
# Split the data with specified parameters
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.34, random_state=76)
# Calculate average of first 10 test values
average_first_10 = y_test[:10].mean()
print("Average of the first 10 values in testing y:", average_first_10)
# Calculate average of first 10 training X values for the selling price 'sprice'
average_first_10_sprice = X_train['sprice'][:10].mean()
print("Average of the first 10 values in training X for sprice:", average_first_10_sprice)Average of the first 10 values in testing y: 0.4
Average of the first 10 values in training X for sprice: 2131970.0